アワビデータの線形重回帰分析の手順例
Data: https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
Abalone Data (modified: some data are replaced by N/A):
Sex / nominal / -- / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years
import sys
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import statsmodels.api as sm
import statsmodels.formula.api as smf
my_libs_dir = '../'
if not my_libs_dir in sys.path:
sys.path.append(my_libs_dir) # add the path to my_lib directory
# The following lines are needed to auto-reload my library file
# Without these lines, my library file is read only once and
# modifications of my library file are not reflected.
%load_ext autoreload
%autoreload 1
%aimport my_libs.linear_reg
# import from my library file
from my_libs.linear_reg import step_aic_forward, calc_vifs
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
%config InlineBackend.figure_formats = {'png', 'retina'} #high-reso images
plt.rcParams['font.family'] = 'Yu Mincho' # for Japanese in graph (Win10)
# To show all rows and columns in the results
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
目的変数に影響を与えていそうな要因は、可能な限り網羅的に説明変数に取り入れる。
encoding='shift-jis' may be needed.
CSVファイルをチェックしてから読み込む。必要に応じて列ラベルを変更。
CSVファイルの漢字コードがShift-JISの場合は encoding='shift-jis' が必要。
csv_in = 'abalone_modified.csv'
df_all = pd.read_csv(csv_in, delimiter=',', skiprows=0, header=None)
# no header in csv, so set columns explicitly
df_all.columns = ['sex', 'len', 'd', 'h', 'w_all', 'w_meat', 'w_gut', 'w_shell', 'ring']
print(df_all.shape)
print(df_all.info())
display(df_all.head())
(4177, 9) <class 'pandas.core.frame.DataFrame'> RangeIndex: 4177 entries, 0 to 4176 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sex 4177 non-null object 1 len 4177 non-null float64 2 d 4176 non-null float64 3 h 4177 non-null float64 4 w_all 4175 non-null float64 5 w_meat 4177 non-null float64 6 w_gut 4177 non-null float64 7 w_shell 4177 non-null float64 8 ring 4177 non-null int64 dtypes: float64(7), int64(1), object(1) memory usage: 293.8+ KB None
| sex | len | d | h | w_all | w_meat | w_gut | w_shell | ring | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.150 | 15 |
| 1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 | 7 |
| 2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 | 9 |
| 3 | M | 0.440 | NaN | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.155 | 10 |
| 4 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 | 7 |
数値列・カテゴリー列の様子をみる
display(df_all.describe())
display(df_all.describe(exclude='number'))
| len | d | h | w_all | w_meat | w_gut | w_shell | ring | |
|---|---|---|---|---|---|---|---|---|
| count | 4177.000000 | 4176.000000 | 4177.000000 | 4175.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 |
| mean | 0.523992 | 0.407892 | 0.139516 | 0.828871 | 0.359367 | 0.180594 | 0.238831 | 9.933684 |
| std | 0.120093 | 0.099250 | 0.041827 | 0.490450 | 0.221963 | 0.109614 | 0.139203 | 3.224169 |
| min | 0.075000 | 0.055000 | 0.000000 | 0.002000 | 0.001000 | 0.000500 | 0.001500 | 1.000000 |
| 25% | 0.450000 | 0.350000 | 0.115000 | 0.441750 | 0.186000 | 0.093500 | 0.130000 | 8.000000 |
| 50% | 0.545000 | 0.425000 | 0.140000 | 0.800000 | 0.336000 | 0.171000 | 0.234000 | 9.000000 |
| 75% | 0.615000 | 0.480000 | 0.165000 | 1.153500 | 0.502000 | 0.253000 | 0.329000 | 11.000000 |
| max | 0.815000 | 0.650000 | 1.130000 | 2.825500 | 1.488000 | 0.760000 | 1.005000 | 29.000000 |
| sex | |
|---|---|
| count | 4177 |
| unique | 3 |
| top | M |
| freq | 1528 |
欠損値を含む行を表示してみる
display(df_all[df_all.isnull().any(axis=1)])
| sex | len | d | h | w_all | w_meat | w_gut | w_shell | ring | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | M | 0.440 | NaN | 0.125 | 0.516 | 0.2155 | 0.1140 | 0.155 | 10 |
| 5 | I | 0.425 | 0.300 | 0.095 | NaN | 0.1410 | 0.0775 | 0.120 | 8 |
| 7 | F | 0.545 | 0.425 | 0.125 | NaN | 0.2940 | 0.1495 | 0.260 | 16 |
欠損値を含む行を削除する (または欠損値を埋める)
df_all = df_all.dropna().reset_index(drop=True)
#df_all = df_all.fillna(df_all.mean()) # if you want to fill missing data instead of deleting them
print(df_all.shape)
display(df_all.head())
(4174, 9)
| sex | len | d | h | w_all | w_meat | w_gut | w_shell | ring | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.150 | 15 |
| 1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 | 7 |
| 2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 | 9 |
| 3 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 | 7 |
| 4 | F | 0.530 | 0.415 | 0.150 | 0.7775 | 0.2370 | 0.1415 | 0.330 | 20 |
説明変数と目的変数を分ける
X_all_org = df_all.loc[:, 'sex':'w_shell'] # explanatory variables
#X_all_org = df_all.drop(columns='ring') # alternative way, もうひとつの書き方
y = df_all['ring'] # objective variable
print('X_all_org:', X_all_org.shape)
display(X_all_org.head())
print('y:', y.shape)
print(y.head())
X_all_org: (4174, 8)
| sex | len | d | h | w_all | w_meat | w_gut | w_shell | |
|---|---|---|---|---|---|---|---|---|
| 0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.150 |
| 1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 |
| 2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 |
| 3 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 |
| 4 | F | 0.530 | 0.415 | 0.150 | 0.7775 | 0.2370 | 0.1415 | 0.330 |
y: (4174,) 0 15 1 7 2 9 3 7 4 20 Name: ring, dtype: int64
ダミー変数化
X_all = pd.get_dummies(X_all_org, drop_first=True)
print('X_all:', X_all.shape)
display(X_all.head())
X_all: (4174, 9)
| len | d | h | w_all | w_meat | w_gut | w_shell | sex_I | sex_M | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.150 | 0 | 1 |
| 1 | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.070 | 0 | 1 |
| 2 | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.210 | 0 | 0 |
| 3 | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.055 | 1 | 0 |
| 4 | 0.530 | 0.415 | 0.150 | 0.7775 | 0.2370 | 0.1415 | 0.330 | 0 | 0 |
変数間の総当たり散布図を描画。相関係数も算出しておく
総当たりのPearson相関係数
corr_all = X_all.corr(method='pearson')
display(corr_all)
| len | d | h | w_all | w_meat | w_gut | w_shell | sex_I | sex_M | |
|---|---|---|---|---|---|---|---|---|---|
| len | 1.000000 | 0.986824 | 0.827555 | 0.925252 | 0.897911 | 0.903016 | 0.897679 | -0.551546 | 0.236762 |
| d | 0.986824 | 1.000000 | 0.833662 | 0.925445 | 0.893159 | 0.899723 | 0.905311 | -0.564281 | 0.240469 |
| h | 0.827555 | 0.833662 | 1.000000 | 0.819178 | 0.774907 | 0.798264 | 0.817327 | -0.518572 | 0.215422 |
| w_all | 0.925252 | 0.925445 | 0.819178 | 1.000000 | 0.969399 | 0.966367 | 0.955353 | -0.557675 | 0.252168 |
| w_meat | 0.897911 | 0.893159 | 0.774907 | 0.969399 | 1.000000 | 0.931939 | 0.882607 | -0.521938 | 0.251896 |
| w_gut | 0.903016 | 0.899723 | 0.798264 | 0.966367 | 0.931939 | 1.000000 | 0.907653 | -0.556193 | 0.242286 |
| w_shell | 0.897679 | 0.905311 | 0.817327 | 0.955353 | 0.882607 | 0.907653 | 1.000000 | -0.547004 | 0.235566 |
| sex_I | -0.551546 | -0.564281 | -0.518572 | -0.557675 | -0.521938 | -0.556193 | -0.547004 | 1.000000 | -0.522557 |
| sex_M | 0.236762 | 0.240469 | 0.215422 | 0.252168 | 0.251896 | 0.242286 | 0.235566 | -0.522557 | 1.000000 |
相関係数の絶対値が大きい説明変数ペアの出力
Method 1. Straight forward method
th_corr = 0.3
n_X = corr_all.shape[0]
corr_large = []
for i in range(n_X):
for j in range(i+1, n_X):
cc1 = corr_all.iat[i,j]
if cc1 < -th_corr or cc1 > th_corr:
corr_large.append([corr_all.columns[i], corr_all.columns[j], cc1])
corr_large.sort(reverse=True, key=lambda x: abs(x[2]))
display(corr_large)
[['len', 'd', 0.9868243090115055], ['w_all', 'w_meat', 0.9693985446065865], ['w_all', 'w_gut', 0.9663674812276402], ['w_all', 'w_shell', 0.9553531436656598], ['w_meat', 'w_gut', 0.9319388550966381], ['d', 'w_all', 0.9254445248747117], ['len', 'w_all', 0.9252519790478487], ['w_gut', 'w_shell', 0.9076528811707412], ['d', 'w_shell', 0.9053113238944874], ['len', 'w_gut', 0.9030160824486537], ['d', 'w_gut', 0.8997227868688794], ['len', 'w_meat', 0.8979107127746621], ['len', 'w_shell', 0.8976785093367889], ['d', 'w_meat', 0.8931586950571686], ['w_meat', 'w_shell', 0.8826068189019742], ['d', 'h', 0.8336617140875084], ['len', 'h', 0.8275548517614463], ['h', 'w_all', 0.8191784349998139], ['h', 'w_shell', 0.8173265128288625], ['h', 'w_gut', 0.7982642959196726], ['h', 'w_meat', 0.7749074469901602], ['d', 'sex_I', -0.5642807865477236], ['w_all', 'sex_I', -0.5576745603915512], ['w_gut', 'sex_I', -0.5561927240387998], ['len', 'sex_I', -0.5515457053873338], ['w_shell', 'sex_I', -0.5470038494408582], ['sex_I', 'sex_M', -0.5225569756516685], ['w_meat', 'sex_I', -0.5219383763221039], ['h', 'sex_I', -0.5185715607334639]]
Method 2. Rather tricky method ...
th_corr = 0.3
keep = np.triu(np.ones(corr_all.shape), k=1).astype('bool').flatten()
#print(np.ones(corr_all.shape)) # debug
#print(np.triu(np.ones(corr_all.shape), k=1)) # debug
#print(np.triu(np.ones(corr_all.shape), k=1).astype('bool')) # debug
#print(keep) # debug
triu = corr_all.stack()[keep]
#print(corr_all.stack()) # debug
triu_sorted = triu[ np.abs(triu).sort_values(ascending=False).index ]
#print(np.abs(triu).sort_values(ascending=False)) # debug
#print(np.abs(triu).sort_values(ascending=False).index) # debug
print(triu_sorted[ (triu_sorted < -th_corr) | (triu_sorted > th_corr) ])
len d 0.986824
w_all w_meat 0.969399
w_gut 0.966367
w_shell 0.955353
w_meat w_gut 0.931939
d w_all 0.925445
len w_all 0.925252
w_gut w_shell 0.907653
d w_shell 0.905311
len w_gut 0.903016
d w_gut 0.899723
len w_meat 0.897911
w_shell 0.897679
d w_meat 0.893159
w_meat w_shell 0.882607
d h 0.833662
len h 0.827555
h w_all 0.819178
w_shell 0.817327
w_gut 0.798264
w_meat 0.774907
d sex_I -0.564281
w_all sex_I -0.557675
w_gut sex_I -0.556193
len sex_I -0.551546
w_shell sex_I -0.547004
sex_I sex_M -0.522557
w_meat sex_I -0.521938
h sex_I -0.518572
dtype: float64
変数間の総当たり散布図
scatter_matrix(X_all, figsize=(30,30))
plt.show()
seabornを使うなら
sns.pairplot(X_all)
plt.show()
Heatmapを描いてもよい
plt.figure(figsize=(10,10))
sns.heatmap(corr_all,annot=True,fmt='.2f',cmap='bwr')
plt.show()
全説明変数を用いて、標準化なしで線形重回帰分析
X_all_c = sm.add_constant(X_all)
model = sm.OLS(y, X_all_c)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ring R-squared: 0.538
Model: OLS Adj. R-squared: 0.537
Method: Least Squares F-statistic: 538.6
Date: Sun, 28 Mar 2021 Prob (F-statistic): 0.00
Time: 16:47:48 Log-Likelihood: -9196.8
No. Observations: 4174 AIC: 1.841e+04
Df Residuals: 4164 BIC: 1.848e+04
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.8927 0.292 13.352 0.000 3.321 4.464
len -0.4806 1.810 -0.266 0.791 -4.029 3.068
d 11.0755 2.228 4.971 0.000 6.707 15.444
h 10.8085 1.536 7.036 0.000 7.797 13.820
w_all 8.9725 0.725 12.371 0.000 7.551 10.394
w_meat -19.7779 0.817 -24.201 0.000 -21.380 -18.176
w_gut -10.5529 1.294 -8.158 0.000 -13.089 -8.017
w_shell 8.7291 1.125 7.762 0.000 6.524 10.934
sex_I -0.8202 0.102 -8.009 0.000 -1.021 -0.619
sex_M 0.0610 0.083 0.732 0.464 -0.102 0.224
==============================================================================
Omnibus: 948.350 Durbin-Watson: 1.438
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2721.642
Skew: 1.182 Prob(JB): 0.00
Kurtosis: 6.172 Cond. No. 137.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#help(results)
#print(dir(results)) # To know all methods/attributes of an object
決定係数や自由度調整済み決定係数をみて、そもそも線形モデルの当てはめが妥当かどうかを判断
print('R2:', results.rsquared)
print('Adj R2:', results.rsquared_adj)
R2: 0.5379311068528569 Adj R2: 0.5369323988705503
Rather good ...
重回帰式の検定 (求めた重回帰式は目的変数を説明している?)
print('p-values (F-statistic)', results.f_pvalue)
p-values (F-statistic) 0.0
Very small p-value, so this MLR equation is considered to be significant.
Compare coefficients for explanatory variables
全説明変数と目的変数を標準化して線形重回帰分析
標準化偏回帰係数を比較
# NOTE: after scaling, X_scaled and Y_scaled are ndarray, not DataFrame.
X_scaled = preprocessing.scale(X_all)
y_scaled = preprocessing.scale(y)
model = sm.OLS(y_scaled, X_scaled)
results = model.fit()
print(results.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.538
Model: OLS Adj. R-squared (uncentered): 0.537
Method: Least Squares F-statistic: 538.8
Date: Sun, 28 Mar 2021 Prob (F-statistic): 0.00
Time: 16:47:48 Log-Likelihood: -4311.4
No. Observations: 4174 AIC: 8641.
Df Residuals: 4165 BIC: 8698.
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 -0.0179 0.067 -0.266 0.791 -0.150 0.114
x2 0.3410 0.069 4.971 0.000 0.207 0.475
x3 0.1403 0.020 7.037 0.000 0.101 0.179
x4 1.3651 0.110 12.372 0.000 1.149 1.581
x5 -1.3620 0.056 -24.204 0.000 -1.472 -1.252
x6 -0.3589 0.044 -8.159 0.000 -0.445 -0.273
x7 0.3770 0.049 7.763 0.000 0.282 0.472
x8 -0.1188 0.015 -8.010 0.000 -0.148 -0.090
x9 0.0091 0.012 0.732 0.464 -0.015 0.034
==============================================================================
Omnibus: 948.350 Durbin-Watson: 1.438
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2721.642
Skew: 1.182 Prob(JB): 0.00
Kurtosis: 6.172 Cond. No. 32.0
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
変数選択
# NOTE: make DataFrames corresponding to X_scaled and y_scaled.
dfX_scaled = pd.DataFrame(X_scaled, columns=X_all.columns)
dfy_scaled = pd.Series(y_scaled, name=y.name)
exog = list(dfX_scaled.columns) # Initial set = all explanatory variables
endog = [dfy_scaled.name] # Objective variables
df_scaled = pd.concat([dfX_scaled, dfy_scaled], axis=1)
変数増加法による変数選択
results_aic=step_aic_forward(smf.ols, exog, endog, data=df_scaled)
AIC: 11847.299, formula: ring ~ 1 AIC: 11034.346, formula: ring ~ w_meat AIC: 10406.185, formula: ring ~ w_all AIC: 10969.684, formula: ring ~ sex_I AIC: 10293.403, formula: ring ~ h AIC: 10175.294, formula: ring ~ d AIC: 10625.364, formula: ring ~ w_gut AIC: 11708.375, formula: ring ~ sex_M AIC: 10299.876, formula: ring ~ len AIC: 9758.050, formula: ring ~ w_shell AIC: 9169.481, formula: ring ~ w_shell + w_meat AIC: 9475.064, formula: ring ~ w_shell + w_all AIC: 9675.054, formula: ring ~ w_shell + sex_I AIC: 9718.311, formula: ring ~ w_shell + h AIC: 9758.451, formula: ring ~ w_shell + d AIC: 9587.967, formula: ring ~ w_shell + w_gut AIC: 9751.448, formula: ring ~ w_shell + sex_M AIC: 9758.475, formula: ring ~ w_shell + len AIC: 9021.620, formula: ring ~ w_shell + w_meat + w_all AIC: 9014.504, formula: ring ~ w_shell + w_meat + sex_I AIC: 9021.593, formula: ring ~ w_shell + w_meat + h AIC: 8940.150, formula: ring ~ w_shell + w_meat + d AIC: 9157.809, formula: ring ~ w_shell + w_meat + w_gut AIC: 9140.034, formula: ring ~ w_shell + w_meat + sex_M AIC: 8989.039, formula: ring ~ w_shell + w_meat + len AIC: 8834.347, formula: ring ~ w_shell + w_meat + d + w_all AIC: 8834.317, formula: ring ~ w_shell + w_meat + d + sex_I AIC: 8881.602, formula: ring ~ w_shell + w_meat + d + h AIC: 8941.351, formula: ring ~ w_shell + w_meat + d + w_gut AIC: 8912.940, formula: ring ~ w_shell + w_meat + d + sex_M AIC: 8941.655, formula: ring ~ w_shell + w_meat + d + len AIC: 8746.310, formula: ring ~ w_shell + w_meat + d + sex_I + w_all AIC: 8786.637, formula: ring ~ w_shell + w_meat + d + sex_I + h AIC: 8836.249, formula: ring ~ w_shell + w_meat + d + sex_I + w_gut AIC: 8836.078, formula: ring ~ w_shell + w_meat + d + sex_I + sex_M AIC: 8836.280, formula: ring ~ w_shell + w_meat + d + sex_I + len AIC: 8705.152, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + h AIC: 8686.453, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut AIC: 8747.530, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + sex_M AIC: 8747.694, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + len AIC: 8639.403, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut + h AIC: 8688.133, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut + sex_M AIC: 8688.452, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut + len AIC: 8640.869, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut + h + sex_M AIC: 8641.336, formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut + h + len The best formula: ring ~ w_shell + w_meat + d + sex_I + w_all + w_gut + h Minimum AIC: 8639.403
print(results_aic.aic)
print(results_aic.model.exog_names)
print(results_aic.model.endog_names)
8639.403306647066 ['Intercept', 'w_shell', 'w_meat', 'd', 'sex_I', 'w_all', 'w_gut', 'h'] ring
マルチコのチェック
Format of results of statsmodels: https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html
endogs = results_aic.model.endog_names
exogs = results_aic.model.exog_names.copy()
exogs.remove('Intercept')
print(exogs) # debug
X_c = sm.add_constant(X_all[exogs])
vifs = calc_vifs(X_c, y)
display(vifs)
['w_shell', 'w_meat', 'd', 'sex_I', 'w_all', 'w_gut', 'h']
| VIF | |
|---|---|
| const | 62.022437 |
| w_shell | 21.237845 |
| w_meat | 28.276245 |
| d | 8.337192 |
| sex_I | 1.512135 |
| w_all | 109.715404 |
| w_gut | 17.258319 |
| h | 3.575501 |
How to eliminate multicolinearity is case by case.
Here we simply delete three explanatory variables with high VIF.
exogs.remove('w_all')
exogs.remove('w_meat')
exogs.remove('w_shell')
print(exogs) # debug
X_c = sm.add_constant(X_all[exogs])
vifs = calc_vifs(X_c, y)
display(vifs)
['d', 'sex_I', 'w_gut', 'h']
| VIF | |
|---|---|
| const | 50.267875 |
| d | 6.606041 |
| sex_I | 1.504112 |
| w_gut | 5.543189 |
| h | 3.437739 |
For all explantory variables, VIF < 10, so we can go forward.
最終的に得られた標準化偏回帰係数から、各説明変数の目的変数に対する影響の大きさがわかる
X_final_scaled = dfX_scaled[exogs]
model_final_scaled = sm.OLS(y_scaled, X_final_scaled)
results_final_scaled = model_final_scaled.fit()
print(results_final_scaled.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.371
Model: OLS Adj. R-squared (uncentered): 0.370
Method: Least Squares F-statistic: 614.2
Date: Sun, 28 Mar 2021 Prob (F-statistic): 0.00
Time: 16:47:49 Log-Likelihood: -4955.9
No. Observations: 4174 AIC: 9920.
Df Residuals: 4170 BIC: 9945.
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
d 0.4256 0.032 13.479 0.000 0.364 0.487
sex_I -0.1577 0.015 -10.466 0.000 -0.187 -0.128
w_gut -0.1744 0.029 -6.031 0.000 -0.231 -0.118
h 0.2605 0.023 11.438 0.000 0.216 0.305
==============================================================================
Omnibus: 1320.783 Durbin-Watson: 1.085
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5933.437
Skew: 1.473 Prob(JB): 0.00
Kurtosis: 8.044 Cond. No. 5.68
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(results_final_scaled.params)
d 0.425571 sex_I -0.157683 w_gut -0.174419 h 0.260505 dtype: float64
The order of strengths of influences on objective variables:
d (positive) > h (positive) > w_gut (negative) > sex_I (negative)
(これが、目的変数ringに対する各説明変数(d, h, w_gut, sex_I)の影響の大きさ順)
重回帰式の検定 (求めた重回帰式は目的変数を説明している?)
print('p-values (F-statistic)', results_final_scaled.f_pvalue)
p-values (F-statistic) 0.0
Very small p-value, so this MLR equation is considered to be significant.
選択された説明変数を用いて、標準化なしで線形重回帰分析
X_final_c = sm.add_constant(X_all[exogs])
model_final = sm.OLS(y, X_final_c)
results_final = model_final.fit()
print(results_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ring R-squared: 0.371
Model: OLS Adj. R-squared: 0.370
Method: Least Squares F-statistic: 614.1
Date: Sun, 28 Mar 2021 Prob (F-statistic): 0.00
Time: 16:47:49 Log-Likelihood: -9841.4
No. Observations: 4174 AIC: 1.969e+04
Df Residuals: 4169 BIC: 1.972e+04
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.7696 0.281 9.864 0.000 2.219 3.320
d 13.8221 1.026 13.477 0.000 11.811 15.833
sex_I -1.0885 0.104 -10.465 0.000 -1.292 -0.885
w_gut -5.1287 0.851 -6.030 0.000 -6.796 -3.461
h 20.0746 1.755 11.436 0.000 16.633 23.516
==============================================================================
Omnibus: 1320.783 Durbin-Watson: 1.085
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5933.437
Skew: 1.473 Prob(JB): 0.00
Kurtosis: 8.044 Cond. No. 53.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
決定係数や自由度調整済み決定係数をみて、線形モデルの当てはまりの良さをチェック
print('R2:', results_final.rsquared)
print('Adj R2:', results_final.rsquared_adj)
R2: 0.3707362215803347 Adj R2: 0.37013246645592146
The fit of "the best model" is not good ...
最終的に得られた偏回帰係数から、「各説明変数が1増えたときの目的変数の増分」がわかる。
print(results_final.params)
const 2.769553 d 13.822066 sex_I -1.088474 w_gut -5.128734 h 20.074556 dtype: float64
Coefficients of MLR;
Increment of objective variable when corresponding variable is increased by 1
and other variables are not changed
ring $\sim$ 2.770 + 20.075 h + 13.822 d + (-5.129) w_gut + (-1.088) sex_I
NOTICE: add 1 at the head of each data for constant.
(You can use sm.add_constant() to add 1)
得られたベストモデルを用いて、予測を行う。
注: 予測のための各Xデータの先頭には定数項用の1を付加すること。
(sm.add_constant()を用いて付加してもよい)
dfX_test = pd.DataFrame([[1, 0.1, 0, 0.1, 0.1],
[1, 0.2, 1, 0.2, 0.2],
],
columns=results_final.params.index) # example
print('X for prediction:')
display(dfX_test)
y_test = results_final.predict(dfX_test)
print('Predicted y:')
print(y_test)
X for prediction:
| const | d | sex_I | w_gut | h | |
|---|---|---|---|---|---|
| 0 | 1 | 0.1 | 0 | 0.1 | 0.1 |
| 1 | 1 | 0.2 | 1 | 0.2 | 0.2 |
Predicted y: 0 5.646342 1 7.434656 dtype: float64